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In this paper, we propose two metrics to compare DNA and protein sequences based on a Poisson model 
of word occurrences. Instead of comparing the frequencies of all fixed-length words in two sequences, we 
consider (1) the probability of ‘generating’ one sequence under the Poisson model estimated from the 
other; (2) their different expression levels of words. Phylogenetic trees of 25 viruses including SARS-CoVs 
are constructed to illustrate our approach. 
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1. Introduction 

One of the fundamental tasks in bioinformatics is sequence 
comparison, which is used heavily in database searching, sequence 
classification, phylogenetic tree reconstruction and detection of 
regulatory sequences. In most cases, alignments are performed be¬ 
tween the target sequences by dynamic programming techniques 
and the resulting alignment scores are used to calculate a measure 
of similarity. Meanwhile, especially in recent years, an increasing 
number of alignment-free methods have emerged [1-4]. In con¬ 
trast to traditional alignments, these alignment-free methods 
mostly (i) make few assumptions of the evolutionary model and 
(ii) present light computational load. With the first merit, align¬ 
ment-free methods do not suffer greatly from some evolutionary 
events, e.g., large rearrangements and transposon activity. While 
the second merit enables broad contributions of alignment-free 
comparisons in pre-filtering relevant sequences, and then using 
alignment algorithms to refine the searches. This type of heuristic 
approach is already used in programs like BLAST [5] and FASTA [6]. 
Additionally, after the completion of many genome projects, align¬ 
ment-free comparisons begin to find their use in whole genome 
phylogeny, which meets great computational and theoretical chal¬ 
lenges using alignment-based methods. 
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Sequence comparison based on word statistics may be the most 
well-developed alignment-free method. Observing that relative 
abundances of all dinucleotides are remarkably constant across 
the genome, Karlin et al. [7-9] proposed the ‘genome signature’ 
to describe a genome. The ‘signature’ consists of the array of dinu¬ 
cleotide relative abundances p xy =f xy /f x f y extended over all dinu¬ 
cleotides, where f x is the frequency of nucleotide x and f xy is the 
frequency of dinucleotide xy. In the same manner, genome signa¬ 
ture based on abundances of k-nucleotides can also be defined. 
Reinert et al. [10] studied the statistical and probabilistic proper¬ 
ties of words in sequences, with emphasis on the deductions of ex¬ 
act distributions and evaluation of its asymptotic approximations. 
Word-based comparisons were recently reviewed by Vinga and Al¬ 
meida [2]. According to their work, biological sequences are first 
represented as frequency vectors in Euclidean space, and then pair¬ 
wise distances between these sequences can be defined as the 
standard Euclidean distance, Mahalanobis distance, linear correla¬ 
tion coefficient or Kullback-Leibler discrepancy between their cor¬ 
responding vectors. As another powerful tool for sequence 
analysis, some graphical representations of DNA or protein se¬ 
quences are also based on statistics of short words [11,12]. 

In this paper, we propose two distance measures for biological 
sequences on the basis of word statistics. Instead of comparing 
the frequencies or relative compositions of each word type in 
two sequences, we explore two measures in the probabilistic 
framework. Some basic concepts and our computational methods 
are introduced in the following section. To illustrate our method, 
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in Section 3, similarity trees of 25 virus genomes are built by some 
classical distances and our methods. 

2. Methods 

A sequence S , of length 2, is defined as a linear succession of 
symbols from a finite alphabet j/, of length n. A /<-word (or k- 
mer, /<-tuple, etc.) co = oqa 2 • • • oc k is a subsequence of k adjacent 
letters, a, e stf, i = 1,2,... ,k. Obviously, there are a total of n k 
possible k-words for the alphabet stf. The occurrence of co (de¬ 
noted by No,) is the number of times it is seen through sliding 
a window of width k once across the sequence, and frequency 
of this word f 0) is obtained by simply dividing the total number 
of words (i.e., f 0) = N m /(l - k + 1)). Given a symbol sequence, we 
can represent it as a point in the high dimensional Euclidean 
space by a mapping from S to the vector of its word counts, 
or frequencies: 

N(S) = (N Wl ,N W2 ,...,N % J or /(5) = (f 0Jl J 0j2 ,...,/ ft)n J. 

For DNA sequences, st = {A, G, C,T}. If 2-words are considered and 
words in above vectors are arranged as (AA, AG, AC, AT, GA, GG,.. .,TT), 
the corresponding vectors for S = AAAGGA are 

N(S) = (2,1,0,0,1,1,0,. ..,0) and 
f(S) = (0.4,0.2,0,0,0.2,0.2,0,..., 0). 

To evaluate the distance between two sequences, it is intuitive to 
compute the norm of the difference between their corresponding 
frequency (or occurrence) vectors, 

where/ l f0 . and f 2co . are frequencies of the word <x>, in sequences 
S-i and S 2 , respectively. The norm gives mathematically well defined 
distance functions for all positive values of p. Here p = 1 gives the 
Manhattan distance, which was used in [7,13]; p = 2 gives the 
Euclidean distance [14]; p = oc gives the max-norm (where only 
the largest absolute value contributes). However, these simple dis¬ 
tances are not satisfying for an accuracy phylogeny, because (i) they 
treat all word types equally, despite that they have different 
background frequencies, and (ii) contribution of a word may 
not merely be a polynomial function of the frequency difference. 
In order to overcome the above problems, the Mahalanobis and 
standard Euclidean distance, which take into account the data 
covariance structure, were proposed for sequence comparison rel¬ 
atively recently [15]. In this paper, we will propose two distance 
measurements free of such problems by using a probabilistic 
framework. 

The most immediate model for word occurrences is the bino¬ 
mial distribution, i.e., each word co has the same probability p to 
appear at any word location. When p is very small, sequence length 
l is sufficiently large, and the value of Ip is moderate, the occur¬ 
rences of co in this sequence approximately follow the Poisson dis¬ 
tribution with the parameter Ip. In what follows we will explore 
two distance metrics on the basis of the Poisson distribution of 
word occurrences. 

2.2. The relative Poisson distance 

For simplicity, we assume that Si and S 2 have the same length l 
(or else we can normalize one of them). Occurrences of word co, in 
these two sequences are denoted by 2V 1)ft) . and N 2 co ., respectively. 
In the first step, we use S ^ to estimate the Poisson parameter. 
Known that the parameter 2 of Poisson model is equal to the 
expectation of the variable (word occurrence, in our model), we 
intuitively set 2 = N h0)r Then define 


RPco,(S 1? S 2 ) = Poi(N 2)C0 .;N 1)C0 .) 



where Poi(/c; 2) is the Poisson probability with parameter 2, 

)! < e~ x 

Poi(k; X) = j£j . 



Actually, RP a) .(S 1 ,S 2 ) measures a kind of ‘similarity’ between 
Si and S 2 in terms of the occurrences of co, (note that it is not a 
strict similarity measure as it is not symmetrical). Explicitly, low 
values of RP (0 . correspond to the relatively large discrepancies in 
occurrences of the word co,-, and the maximum value is gotten when 
Ni jG)i . = N 2 , COi or Ni )(y . = N 2 ,co i + 1. Taking all words into consider¬ 
ation, the final distance between Si and S 2 is defined 


d R P (Si,S 2 ) = ^(RP(o,(Si,Si) + RP (0i (S 2 ,S 2 ) - RPa),(Si,S 2 ) 

i= 1 

- RPco f (S 2 ,Si)). (2) 

Here the two terms RP^-(Si, Si) and RP ft > i (S 2 ,S 2 ) are intro¬ 
duced to guarantee the positivity of d RP (Si,S 2 ) (note that 
RP^. (Si, Si) ^ RP (0i (Si, S 2 ) for any word co,). 

Since RP 0 J .(Si,S 2 ) measures the probability to observe N 2co . 
times of co, in sequence S 2 in the condition that the average occur¬ 
rence is N i )(M ., we refer to d RP as the Relative Poisson distance be¬ 
tween Si and S 2 . 


2.2. The distance based on expression level of words 


In the above subsection, we consider only one Poisson model - 
parameter of this model is estimated by one sequence, and pair¬ 
wise similarity is evaluated by the probability of generating the 
other sequence under this model. In this part, the occurrences of 
word co, in sequences Si and S 2 follow two different Poisson distri¬ 
butions (with parameters 2i ;I - and 2 2j ,, respectively). Define 


* 1 , 0 , 

Exp,= ^Poi(/C; M,i), 

k =0 



where N i }C0 . is the occurrence of co, in Si. Expi is the probability of 
observing ^ occurrences of co, in sequence Si. Note that a word 
is called highly expressed if its observed frequency is more than its 
expected frequency, and called low expressed otherwise. In this 
sense, the probability Expi measures a level of expression - low 
value of Expi Mi corresponds to low expression of word coi, and large 
value of Expi corresponds to high expression of the word coi in 
sequence Si. We define the final distance between Si and S 2 as 

n k 

d E xp(Si,S 2 ) = l Ex Pi,», - Ex P2, ffli l' (4) 

1=1 


Now, to compute d Exp (Si,S 2 ) we nee d to determine 2i., and 2 2i , for 
each word in each sequence. Note that the Poisson parameter for 
a word is actually its expected occurrence, which can be obtained 
immediately by multiplying the expected frequency (or background 
frequency) by the total number of words. We now only need to 
determine the background frequency of each word. To achieve this 
aim, two approaches are tried. The first approach corresponds to 
independence of nucleotides in the sequence, i.e., background fre¬ 
quency of the word co is estimated by the product of the corre¬ 
sponding nucleotide frequencies in this sequence, 




OC j OC2 '"OCk 




where f a . (2 = 1 , 2 ,..., k) is the frequency of the letter a, in this se¬ 
quence. An alternative method for estimating the background fre¬ 
quency of a word was proposed by Qi et al. [16], who applied a 
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Markov model of DNA sequences of order k - 2. The expected fre¬ 
quency of a word is predicted from the probabilities of appropriate 
shorter subwords 

f f _ ' Jcc 2 cc 3 -cc k fd\ 

J oo — J a- l a 2 -- a k — r 5 

J 0t 2 OC 3 ■•■0C k _-[ 

where/ ai0C2 ... akl is the frequency of the (/< - l)-word ocia 2 • • • a&- 1 in 
the corresponding sequence. Then for each background probability 
estimated by Eqs. (5) and (6), the corresponding Poisson parameter 
is 

*a> — fco ' (/ — /<+ 1 ) . 

The distance d Exp has the following properties: (i) cf Exp (Si,S 2 ) ^ 
0 and d Ex p(S l7 Si) = 0, for any sequences Si and S 2 ; (ii) background 
information (or frequencies of shorter words) is incorporated into 
the measurement; and (iii) words with identical frequency and 
occurrence in two sequences may contribute to d Exp , i.e., they may 
have different background frequencies and expression levels. 

When 2 is large (2 > 50), however, it is difficult to obtain the 
accurate Poisson probability by Eq. (1) using personal computers. 

Explicitly, as e~ x is very small and / k is very large in the numerator, 
mistakes may be made if they are multiplied directly. In order to 
overcome this difficulty, another two executive approximations 
of Poisson probability in the case of large 2 are tried: (i) Stirling for¬ 
mula. According to the Stirling formula, kl ( k/e) k V2nk , so 
P(/< ; X)= ^ fc g ^ = (z/k) k e k - k V2nk. (ii) Normal approximation of 

Poisson distribution. When 2 is sufficiently large, the Poisson dis¬ 
tribution with parameter 2 can be approximated by the Normal 
distribution N(/, X). 


3. Application 

3.1. Phylogenetic trees of 25 viruses including SARS-CoVs 

Coronaviruses are the causative agents of a number of 
mammalian diseases which often have significant economic and 
health-related consequences [17,18]. On the basis of antigenic 
cross-reactivity, coronaviruses were originally classified into three 
groups. Group I and group II contain mammalian viruses (while 
group II coronaviruses contain a hemagglutinin esterase gene 
homologous to that of Influenza C virus [19]), and group III con¬ 
tains only avian viruses. After the outbreak of severe acute respira¬ 
tory syndrome coronavirus (SARS-CoV) in 2003, many efforts have 
been made to identify the phylogenetic positions of SARS-CoVs in 
the coronavirus phylogeny. However, this is still a controversial to¬ 
pic - alignment-based methods showed that SARS-CoVs are not 
closely related to any previously isolated groups and form a new 
group [20,21]; maximum likelihood tree built from a fragment of 
the spike protein preferred SARS-CoVs clustering with group II cor¬ 
onaviruses (murine hepatitis virus and rat coronavirus) [22]; while 
an information-based method, which made use of the whole gen¬ 
ome sequences, indicated that the SARS-CoVs should not be classi¬ 
fied as a new group but close to the group I coronaviruses [23]. 

In this paper, we select 25 complete virus genomes: 12 coronav¬ 
iruses from the three isolated typical groups, 12 SARS-CoV strains, 
and a torovirus, which serves as the outgroup for coronaviruses 
[24] (data are shown in Table 1). In order to validate our method, 
distance matrices for the same data set are also constructed using 
some classical dissimilarity measurements, e.g., the standard 
Euclidean distance [15,25], linear correlation coefficient [26], Kull- 
back-Leibler (KL) discrepancy [3] and the Composition Vector ap¬ 
proach [16,27]. Note that the Kullback-Leibler discrepancy 
between two frequency vectors is not symmetrical and will give 
degenerate results when some word types are absent, we use a re¬ 


vised version - the Weighted Sequence Entropy (WSE) [28]. This 
modification works equivalently with the KL discrepancy in the 
case of short words, and can effectively avoid the degeneracy for 
long words. The string Composition Vector (CV) approach proposed 
by Hao’s group is a fast and efficient approach to whole genome 
comparison and phylogenetic analysis. For each k-string co, define 

f -fcf® f 7 ^ 0 , 

CV 0J = { Jc ° (7) 

10, f 0) = 0, 

where f co is the frequency of word co in a genomic sequence, and f co 
is its expect frequency under a certain background model (Markov 
model of k - 2 order). Then collect CV W for all possible co as compo¬ 
nents to form a composition vector. The final distance between two 
species is evaluated based on the cosine function between their cor¬ 
responding composition vectors. 

After calculating the pairwise distance matrices, phylogenetic 
trees for the 25 viruses are built by the UPGMA and NJ programs in 
the PHYLIP package. Then, rooted phylogenetic trees are drawn by 
the TREEVIEW program [29]. The UPGMA tree built by the standard 
Euclidean distance is shown as Fig. 1(1). This tree supports torovirus 
as the outgroup of all coronaviruses, but fails to cluster three group I 
coronaviruses - HCoV-229E and PEDV are grouped together, but 
TGEV is much closer to the SARS clade. Fig. 1(2) is the NJ tree con¬ 
structed by the Euclidean distance. Similar to the UPGMA tree, this 
tree also prefers SARS-CoVs clustering with TGEV. But an obvious de¬ 
fect is that it does not successfully cluster the eight group II coronav¬ 
iruses. In Fig. 2, we list the trees built by linear correlation coefficient 
between pairwise frequency vectors. Fig. 2(1) is the UPGMA tree. 
This tree perfectly clusters species within each typical group, and 
confirmed SARS-CoVs paraphyly. But it fails to identify the outgroup 
status of torovirus relative to coronaviruses. While the NJ tree 
(Fig. 2(2)), in which torovirus is selected as outgroup species, con¬ 
firms the adjacent relationship of SARS-CoVs with group I viruses. 
In the tree built from our distance measure d Exp (Fig. 3), all above de¬ 
fects are eliminated, i.e., species of each typical groups cluster, and 
torovirus stays outside of all coronaviruses including SARS-CoVs. 
Our tree shows that SARS-CoVs are not closely related to any previ¬ 
ously isolated coronaviruses and form a new group, but do not sup¬ 
port the outgroup status of SARS-CoVs relative to other 
coronaviruses, as proposed by Zheng et al. [30]. This result is mainly 
in accordance with the WSE tree at word order k = 6 (Fig. 4) and the 
NJ tree constructed by the Composition Vector method (Fig. 5). 
Moreover, it is also supported by the experimental evidence, which 
showed that group I coronaviruses specific antibodies are able to rec¬ 
ognize antigens in SARS-CoV infected cultured cells [31]. 

3.2. Whole mitochondrial genome phylogeny of 20 Eutherian 
mammals 

In order to further validate our algorithm, we use the complete 
mtDNA sequences of 20 Eutherian mammals selected by Otu and Say- 
ood as our second dataset [32]. This dataset consists of seven Primates, 
eight Ferungulates, two Rodents and three non-placental mammals. 
Their corresponding GenBank Accession Codes are as follows: 

• Primates: Human ( Homo sapiens, V00662), common chimpanzee 
( Pan troglodytes, D38116), pigmy chimpanzee ( Pan paniscus, 
D38113), gorilla ( Gorilla gorilla, D38114), orangutan ( Pongo pyg- 
maeus, D38115), gibbon ( Hylobates lar, X99256) and baboon 
( Papio hamadryas, Y18001). 

• Ferungulates: Horse ( Equus caballus, X79547), white rhinoceros 
(Ceratotherium simum, Y07726), harbor seal ( Phoca vitulina, 
X63726), gray seal ( Halichoerus grypus, X72004), cat (Fe/i's catus, 
U20753), fin whale ( Balenoptera physalus, X61145), blue whale 
(Balenoptera musculus, X72204) and cow (Bos taurus, V00654). 
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Table 1 

Coronaviruses and a torovirus used to constructed phylogenetic tree. 


No. 

Accession No. 

Abbreviation 

Genome 

Group 

Length (nt) 

1 

NC_002654 

HCoV-229E 

Human coronavirus 229E 

I 

27317 

2 

NC_002306 

TGEV 

Transmissible gastroenteritis virus 

I 

28586 

3 

NC_003436 

PEDV 

Porcine epidemic diarrhea virus 

I 

28033 

4 

U00735 

BCoVM 

Bovine coronavirus strain Mebuus 

II 

31032 

5 

AF391542 

BCoVL 

Bovine coronavirus isolate BCoV-LUN 

II 

31028 

6 

AF220295 

BCoVQ. 

Bovin coronavirus strain Quebec 

II 

31100 

7 

NC_003045 

BCoV 

Bovine coronavirus 

II 

31028 

8 

AF208067 

MHVM 

Murine hepatitis virus strain ML-10 

II 

31233 

9 

AF201929 

MHV2 

Murine hepatitis virus stain 2 

II 

31276 

10 

AF208066 

MHVP 

Murine hepatitis virus stain Penn 97-1 

II 

31112 

11 

NC_001846 

MHV 

Murine hepatitis virus 

II 

31357 

12 

NC_001451 

IBV 

Avian infectious bronchitis virus 

III 

27608 

13 

AY278488 

BJ01 

SARS coronavirus BJ01 

- 

29725 

14 

AY278741 

Urbani 

SARS coronavirus Urbani 

- 

29727 

15 

AY278491 

HKU-39849 

SARS coronavirus HKU-39849 

- 

29742 

16 

AY278554 

CUHK-W1 

SARS coronavirus CUHK-W1 

- 

29736 

17 

AY282752 

CUHK-SulO 

SARS coronavirus CUHK-SulO 

- 

29736 

18 

AY283794 

SIN2500 

SARS coronavirus SIN2500 

- 

29711 

19 

AY283795 

SIN2677 

SARS coronavirus SIN2677 

- 

29705 

20 

AY283796 

SIN2679 

SARS coronavirus SIN2679 

- 

29711 

21 

AY283797 

SIN2748 

SARS coronavirus SIN2748 

- 

29706 

22 

AY283798 

SIN2774 

SARS coronavirus SIN2774 

- 

29711 

23 

AY291451 

TW1 

SARS coronavirus TW1 

- 

29729 

24 

NC_004718 

TOR2 

SARS coronavirus 

- 

29751 

25 

X52374 

EToV 

Equine torovirus 

- 

7920 



Toroviius 

mv 

MHV/2 

MlfVP 

MHVM 

MHV 

BCoVO 

BCoVM 

BCoVL 

BCoV 

HCoV-229E 

PEDV 

TGEV 

BJ01 

Uibdiii 

SIN2677 

SIN7//4 

TW1 

SIN7679 

SIN25O0 

SIN2/48 

HKU-39849 

CUHK-W1 

CUHK-SulO 

TOR? 


Toiovlni* 


IBV 

- HCoV-229E 
PEOV 

- TGEV 

r HKU39W9 

- CUHK-W1 

r CUHK-SulO 

- l TOR2 

- BJ01 

- Urbanl 

» TW1 

• SIN2G7S 

• SIN2S77 

• SIN2774 

• SIN7500 
SIN2748 



MHV? 

MHVI 1 

MHVM 

MHV 

BCoVO 

BCoVM 

BCoVL 

BCoV 


0.001 



Fig. 1. (1) UPGMA and (2) Neighbor-Joining trees of 25 viruses (k = 6). Pairwise distances are evaluated by the standard Euclidean distance. 


• Rodents : Rat ( Rattus norvegicus, XI4848) and mouse (Mus muscu- 
lus, V00711). 

• Non-placental mammals: Opossum ( Didelphis virginiana, 
Z29573), wallaroo ( Macropus robustus, Y10524) and platypus 
(Ornithorhyncus anatinus , X83427). 

We applied the proposed distance measurements to the com¬ 
plete mitochondrial genomes listed above. In Fig. 6, we list the UP¬ 


GMA tree constructed by the distance d Exp with background 
frequencies estimated by Eq. (6). As is seen from this figure, three 
main groups of placental mammals, namely Primates, Ferungulates 
and Rodents, cluster accordingly, and three non-placental mam¬ 
mals stay outside of all other species. This topology is in perfect 
agreement with that given by Otu and Sayood except for the posi¬ 
tion of rodents (mouse and rat). However, the relationship among 
the three main groups of placental mammals is still a controversial 
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Torovirti* 

IBV 

MHV2 

MH VP 

MHVM 

MHV 

BCcVO 

BCoVM 

BCoVL 

BCoV 

TGEV 

HCoV?29fc 

PFOV 

BJ01 

Urban! 

SIN2677 

SIN2774 

TW1 

SIN2679 

SIN2500 

SIH2748 


HKU39W9 
CUHK-W1 
CUMK SulO 
TOR2 


Torovlriis 


IBV 

— HCoV-2?9E 

- PCDV 

- TGFV 

HKU-39849 

TOR? 

CUHK-W1 

- CUHK-StilO 


K 


H 




0.1 


MHV2 
MHVP 
MHVM 
MlfV 

BCoVO 

BCoVM 

BCoVl 

BCoV 


Ufbani 

TW1 

BJ01 

SIN2679 

SIN2748 

SIN27/4 

SIN2500 

SIN26/7 



Fig. 2. (1) UPGMA and (2) Neighbor-Joining trees of 25 viruses (k = 6). Pairwise distances are evaluated by the linear correlation coefficient. 


-E 

rE 

1 —d 


Torovlriis 

IBV 

MHV? 

MHVP 

MHVM 

MHV 

BCoVO 

BCoVM 

BCoVl 

BCoV 

BJ01 

CUHK W1 

HKU39849 

CUHK-SulO 

TOR2 

Urban! 

TW1 

SIM7677 


Outgroup 
Group III 


Group II 


SARS-CoVs 




SIN2774 
SIN2679 
SIN2500 
SIN2748 
PEDV —| 

TGEV Group I 

HCoV-229E—I 


Fig. 3. Phylogenetic tree built by our distance d Exp at k = 6, where background frequency of each word is estimated by the product of the corresponding nucleotide 
frequencies. 
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Toiovims 

IBV 

MHV? — 

MHVP 

MHVM 

MHV 

BCoVO 

BCoVM 

OCoVl 

BCoV — 

TGEV 

CUHK W1 — 
OJHKSulO 
Urban! 
HKU-39849 
DJ01 
TOR2 
TW1 
SIN2677 
SIN2679 
SIH2774 
SIM2500 
SIN2748 
HCoV-229 
PEDV 


Outgroup 
Group III 


Group II 


Group I 


SARS-CoVs 


n 


Group I 


Fig. 4. Phylogenetic tree built by the Weighted Sequence Entropy (WSE) at word length k = 6. 



Toro vims 

TGEV 

MHV2 

MHVP 

MHVM 

MHV 

BCoVO 

BCoVM 
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BCoV 

Uibanl 

HKU-39849 

CUHK-W1 

CUHK-SulO 

TW1 
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SIN2677 

SIN2G/9 

TOR2 

SIN2774 
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PEDV 
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Fig. 5. (1) UPGMA and (2) Neighbor-Joining trees of 25 viruses built by the string Composition Vector approach. 


topic in molecular genetics [33]. Different types of molecular data 
and analysis methods result in different trees. By the maximum 
likelihood method, some proteins support the Ferungulates (Pri¬ 


mates, Rodents) grouping while other proteins support the Ro¬ 
dents (Ferungulates, Primates) grouping [34]. Whereas our result 
suggests an alternative topology of Primates (Ferungulates, Ro- 
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baboon 

gibbon 
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gorilla 

human 
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b. whale 
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g. seal 
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Ferungulates 



Rodents 


Outgroup 


Fig. 6. The UPGMA tree built from the complete mtDNA sequences of 20 mammals. We use the distance metric d Exp , and background frequencies of words are estimated by 
the Markov model of order k - 2. 


dents). In addition, we also applied some other word-based metrics 
mentioned above (the standard Euclidean distance, linear correla¬ 
tion coefficient and KL discrepancy) to the same dataset, but they 
did not give competitive results (not shown in this paper). 

4. Conclusion and discussion 

With the completion of many genome projects of Prokaryotes 
and Eukaryotes, genome level phylogeny constructions are avail¬ 
able and expected to be more reliable compared to traditional 
experiments on only a single gene or a fragment of genome. How¬ 
ever, multiple sequence alignment of genomic sequences is still a 
bottleneck, first due to the computational time, and second due 
to the inherent model assumptions. Therefore, there is a great need 
to develop new sequence comparisons free of these problems. In 
recent years, a quantity of alignment-free methods which are 
based on, e.g., k-words frequency [2], graphical representations 
[35-42], and information contents [32,43], have been proposed. 
Nevertheless, compared to alignment methods, these methods 
are still in the premature stage. 

Sequence comparison based on the genomic composition of short 
words may be the most widely studied alignment-free method. It 
has relatively low computational complexity, and does not suffer 
greatly from genetic rearrangements and transposon activity, which 
serve as common ways of genome evolution. In most cases, biologi¬ 
cal sequences are represented as occurrence or frequency vectors in 
a high dimensional Euclidean space, and then the standard Euclidean 
distance, linear correlation coefficient, Kullback-Leibler (KL) dis¬ 
crepancy or cosine function between these vectors are calculated 
as measures of dissimilarity. In this paper, we investigate two 
word-based distance measurements in a probabilistic framework. 
Our hypothesis is that occurrence of a given word in a random 


DNA sequence follows the Poisson distribution. Then distance be¬ 
tween two sequences is evaluated by the probability of generating 
one sequence under the Poisson model estimated from the other, 
or their different expression levels of words. In contrast to the tradi¬ 
tional word-based distances, which use only frequencies of fixed- 
length words, our distances take background information of words 
(estimated by frequencies of some shorter words or the correspond¬ 
ing nucleotide composition) into account. In other words, our meth¬ 
od has a potential to adjust the background information for distance 
measurements using composition vector. Through constructing 
phylogenetic trees of 25 viruses including SARS-CoVs and 20 Euthe- 
rian mammals, we find that our method gives a more competitive re¬ 
sult compared to the ongoing word-based methods. 

It is detected that each component CV co of the string Compo¬ 
sition Vector is also a measure of expression in terms of word co. 
In Eq. (7), the numerator f co -f (0 is the deviation of the observed 
frequency from the expected value, and denominator is intro¬ 
duced to eliminate the size effect. However, different from our 
measure (Eq. (3)), the value of CV co may be affected by those 
words with very low background frequency, i.e., when f 0) is very 
small, the corresponding CV co will be very large. While our mea¬ 
sure is free of this problem as it ranges from 0 to 1. In other 
words, our method can avoid the noise accompanied by words 
with exceptional background frequencies. 

However, compared to those word-based measurements which 
consider only composition vectors, our distances have relatively 
high computational costs. For example, occurrences of many words 
are much higher than 60 in some bacterial genomes (when k = 10), 
which makes our Poisson-based distances computationally infeasi¬ 
ble. So a reliable and efficient approximation of Poisson probability 
is critical to our method. In addition, the accuracy of our approach 
depends strongly on the Poisson model of word occurrences. This 
assumption is generally valid when the sequence length is suffi- 
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ciently large. But for words with overlapping structure, e.g., TATATA 
and CCGCCG, their occurrences in a random sequence may vary sig¬ 
nificantly from the Poisson distribution. While at the same time, 
experiments showed that these self-overlapping words are more 
prone to be functional patterns in regular regions of genomes. In 
the future study, we will explore some models to describe and com¬ 
pare these words. 
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